Deploy and Use Open Source GPT Models for RAG
You work as a network engineer at a renowned system integrator company. You are tasked with configuring a broad range of networking devices—from enterprise-level Cisco Catalyst switches to Cisco Nexus 9000 devices in data centers. Keeping track of configuration details across these platforms is a challenging task. Although you are adept at reading and writing technical documentation, it still takes you a considerable amount of time. Despite having a subscription to a cloud AI provider that could streamline your search process, company policy restricts you from uploading any confidential information to the cloud AI provider. Alternatively, you think of deploying a comparable AI solution on-premises yourself.
While searching for an appropriate on-premises solution, you come across various open-source GPT models and chatbot applications capable of managing general IT tasks. A standout discovery is the open-source Open WebUI application, which incorporates an Ollama inference server and offers a user-friendly chat interface. This interface is equipped with advanced features such as Retrieval Augmented Generation (RAG), allowing you to upload files and use them as reference data for the GPT. Remarkably, deploying this application is straightforward, requiring just a simple Docker command. You decide to try Open WebUI.
To proceed, you need a computer—either physical or virtual—with a GPU, which considerably enhances processing speeds. Fortunately, the IT Ops department has provided you with a Linux VM that is equipped with 8 GB of GPU RAM and Docker already configured to use the GPU resources.
To deploy the Open WebUI application, you run the following Docker command that you found in the official documentation: docker run -d -p 3000:8080 -e WEBUI_AUTH=False --gpus=all -v ollama:/root/.ollama -v open-webui:/app/backend/data --name open-webui --restart always ghcr.io/open-webui/open-webui:ollama.
This Docker command starts the container in detached mode by using -d, which allows it to operate in the background without occupying the terminal. The command also maps port 3000 on your VM to port 8080 on the container via -p 3000:8080, enabling you to access the application through your VM IP address on port 3000. Authentication is disabled with -e WEBUI_AUTH=False for easier access during initial tests. The --gpus=all option allocates all available GPU resources to the Docker container to help ensure optimal performance.
Further, the command mounts volumes with -v ollama:/root/.ollama and -v open-webui:/app/backend/data for storing the downloaded models and application data, respectively. The container is named open-webui for straightforward management and is set to automatically restart with --restart always should it stop unexpectedly, such as during a reboot. Finally, the ghcr.io/open-webui/open-webui:ollama specifies the Docker image that is sourced from the GitHub Container Registry.
When you press the Enter key, you feel relieved that you have avoided a lengthy and tedious installation and configuration process. After a few seconds, the Open WebUI application is up and running on your VM, providing a robust, in-house solution for your documentation and configuration search needs while remaining compliant with your company's data security policies.
Get Started with Open WebUI and RAG
With the Open WebUI application up and running, you begin searching the web to understand how Retrieval Augmented Generation (RAG) functions. You learn that RAG relies on a technique called semantic search, which identifies relevant context within the files you upload to the RAG system. Unlike traditional search methods that search for exact keyword matches, semantic search aims to grasp the intent and contextual meaning behind the words in a query.
A key component that enables semantic search is the embedding process that transforms the text from your queries and potential information sources into numerical vectors. These vectors are essentially lists of numbers that represent text in a high-dimensional space. You can think of these vectors as coordinates on a map where texts with similar meanings have vectors that are closer together. For example, the words "apple" and "banana" will be placed close to each other because both are types of fruit, while the word "keyboard" will be further from the fruits, as it relates to a different context. The same principle applies to phrases or even entire paragraphs.
The files you add to the RAG system undergo several processing steps. First, the text is extracted from the files. This text is then divided into smaller sections called chunks. Chunking is necessary because GPT models have a limit on the number of characters they can process at once. This limit is known as the context size of the model. This context size varies between different GPT models. The way text is chunked significantly impacts the quality of the answers because each chunk should ideally capture all semantically similar content—but you will explore that in more detail later.
After chunking, the sections are transformed into vectors using an embedding model, which is a specialized Large Language Model (LLM). These vectors are then stored in a vector database, optimized for efficiently handling high-dimensional data. You can think of the vector database as a lookup table, where each vector serves as a key linked to its corresponding raw text. The entire process is illustrated in the next figure.
When you ask a question using the RAG system, it first uses the embedding model LLM to convert your question into vectors. It then uses these vectors to search the vector database for the information that is most relevant to your query. The text from the best matches is combined with your original question to create a new expanded context. This expanded context, along with your initial query, is sent to the inference LLM in a raw text format. The inference LLM uses this information to provide an accurate answer, enhancing its built-in knowledge with the most relevant data extracted from the database. You see the entire process shown in the next figure. Note, that the same embedding LLM is used for both creating the database from the uploaded files and for a semantic search with queries.
Satisfied with your high-level understanding of the RAG pipeline, you begin to explore the Open WebUI application.
Step 1
Type http://localhost:3000 in the URL field and press the Enter key.
Step 2
Focus on the navigation panel on the left.
Step 3
Focus on the main chat interface on the right.
Step 4
Click the Arena Model text in the upper-left corner of the chat interface to open a drop-down menu listing all available inference models.
Step 5
Choose llama3.1:latest from the menu.
Step 6
Click in the chat prompt field with the How can I help you today? text.
Step 7
Click the chat prompt field again and write Tell me a joke in the chat prompt.
Step 8
Press the Enter key to continue with the simulation.
Step 9
Click the up-arrow icon or press the Enter key to ignore the suggestion and send the prompt to the GPT.
Step 10
Focus on the various options below the answer.
Step 11
Click the regenerate icon.
Step 12
Notice how this chat is saved in the main navigation panel on the left.
Step 13
Click the ellipsis in the chat entry in the main navigation panel.
Step 14
Click Delete in the menu to delete this chat and cleanup the workspace.
Step 15
Click Confirm.
Configure RAG in Open WebUI
A high-level overview of how RAG works informed you that there are a couple of things that need to be configured in the application. First, you have to specify which embedding model you want the application to use and choose the inference model that generates the final answer. Next, you should set the chunking parameters. You remember that there was an embedding model called mxbai-embed-large readily available in the environment and that you have already tested the llama3.1 inference model. Finally, you will set the chunking method and put it all together within a chat interface to start testing with some real prompts involving general networking know-how. You delve deeper into Open WebUI documentation and start configuring.
Step 16
Click the user icon in the bottom-left corner of the main navigation panel.
Step 17
Click Admin Panel in the menu.
Step 18
Click the Settings tab.
Step 19
Click Connections in the navigation panel.
Step 20
Focus on the OpenAI API section.
Step 21
Click the toggle button to disable the OpenAI API.
Step 22
Click the toggle button in the Ollama API pane to enable the Ollama API.
Step 23
Click the wrench icon for the Ollama connection.
Step 24
Click the click_here link in the Pull a model form Ollama.com section to access the Ollama model library.
Step 25
Press the down-arrow key on your keyboard.
Step 26
Click llama3.1 from the list to see additional variants you could use.
Step 27
Click 8b to reveal a drop-down menu with other model variants.
Step 28
Click View all in the menu.
Step 29
Press the Enter key to return to the Manage Ollama page.
Step 30
Click the X icon in the top-right corner of the page to close the Manage Ollama page.
Step 31
Click Models in the navigation panel.
Step 32
Click Documents in the navigation panel.
Step 33
Focus on the General Settings pane.
Step 34
Click Default (SentenceTransformers).
Step 35
Click Ollama in the menu.
Step 36
Type mxbai-embed-large:latest in the Embedding Model field and press the Enter key.
Step 37
Take a look at the Content Extraction settings.
Step 38
Press the down-arrow key on your keyboard.
Step 39
Press the down-arrow key again.
Step 40
Click Save in the bottom-right corner of the page to apply the settings and complete the RAG configuration.
Step 41
Press the Enter key to remove the green notifications.
Write Basic Prompts with RAG
Now that you have configured the parameters used in RAG, you decide to upload the files you want to work with. Your current project revolves around Cisco Nexus 9000 switches and you want to simplify your task by uploading some configuration guides and use RAG to help you with configuring the Cisco Nexus 9000 switches. You would also like to upload outputs from show commands, such as show runnning configuration or show cdp neighbors, and other more specific outputs. You are hoping that providing this information would be enough for the RAG system to have a good overview of what is already configured and be able to help you with some additional configurations.
You are a bit worried about the impact of different files on the quality of the answers because configuration syntax and the output from various show commands is drastically different from the language used in configuration guides that you intend to use. In addition, you do not know whether llama3.1 was trained to recognize and work with Cisco configuration syntax. Also, knowing how chunking works, you have second thoughts about Chunk Size and Chunk Overlap that you configured and are afraid that the text stored in vector database entries might not capture all the relevant context. Specifically, you are worried about splitting the outputs of show commands in the same way as you split the configuration guides. You decide to test the current settings using the files as they are and simply see if it does the job. First, you will upload all the files and use the OpenWeb UI's built-in pipeline that processes and stores the files in vector format in the vector database. Next, you will test how well the llama3.1 and RAG settings retrieve data from the vector database.
Step 42
Click Workspace in the main navigation pane.
Step 43
Click Knowledge at the top.
Step 44
Click the + icon to create a new collection.
Step 45
Name the knowledge base Fabric Information and add the following description: Output of various show commands.
Step 46
Double-check that the entered information is the same as in the instructions and click Create Knowledge.
Step 47
Press the Enter key to continue.
Step 48
Click the + icon to begin the upload and embedding process.
Step 49
Click Upload files.
Step 50
Press the Enter key to continue the simulation.
Step 51
Click Open to upload the files.
Step 52
Click Knowledge to see the collections again.
Step 53
Click New Chat in the main navigation panel.
Step 58
Click New Chat again.
Answer
Notice how the model stayed the same, but the prompt suggestions changed as expected.
In the next couple of steps, you will learn how to refer to the documents when prompting. The OpenWeb UI uses the # symbol followed by the collection or filename to associate the prompt with the desired knowledge base.
Step 59
Click in the chat field and type a # symbol. Then press the Enter key to continue.
Step 60
Choose the Nexus 9K Complete Configuration Guide.
Answer
You should see the collection in the chat window.
The following prompt will be associated with the collection containing the PDF of the entire interface configuration guide.
Note
In a real environment, you can choose any collection or specific file in the collections by scrolling through this drop-down menu.
Step 61
Press the down-arrow key on your keyboard to insert the following prompt: Can I configure static MAC addresses on tunnel interfaces?
Answer
You should see the prompt inserted in the chat field.
Note
In a real instance of OpenWeb UI, you would have to type the prompt yourself. The down-arrow key shortcut is used for your convenience and is going to be used from now on to insert prompts. Also note that GPT can handle small grammatical mistakes or typos and still provide accurate answers.
Step 62
Click the arrow icon or press the Enter key to send the prompt.
Answer
You should see the response.
The answer you were looking for was supposed to be more like a "Yes" or "No" in nature. Looking at the documentation yourself, you find out that the answer is a clear "no". You can see in the Note on page 131 that you cannot configure static MAC address on tunnel interfaces.
You will examine the output and database entries used by llama3.1 while generating the answer to determine why it provided an incorrect response.
Step 63
Press the down-arrow key to get to the end of the response.
Answer
Notice how it mentions the part of the document referring to the words Layer 3 and static MAC addresses. It also asks you to clarify what kind of interface you had in mind.
You will not change the prompt as suggested. First you will explore what kind of entries from the database the system actually used when generating the answer.
Step 64
Click the citation link under the answer.
Answer
You should see the Citation page displaying the entries from the database that were used as context.
The Source field shows the exact file that was used for this citation with the page in the parenthesis. You can see it took the text from page 404.
Next is the Relevance score. The higher the score, the better the semantic match with the prompt. A score of 27.96 % is considered quite low. This entry is considered the best of the overall top 5 entries in the database.
Examining the entry, you can see that it is almost impossible for a human to read. Looking at this text and comparing it to the actual PDF, and using some imagination, you can figure out that it actually took the text from page 405.
You should consider the page count as a +/- 1 ballpark figure.
Notice how many times the word tunnel appears in the citation. The prompt was asking about tunnel interfaces and the system calculated semantic similarity based on the used words. The repetition of the word tunnel is why the system chose this particular entry as the most relevant context. The system determined that it is statistically most likely to match. Notice also that this text is part of some configuration syntax.
Step 65
Press the down-arrow key to see some other citations.
Answer
You should see the second-best match.
This match has a very similar similarity score as the first one. Again you can see that it matched the configuration syntax due to the repetition of the word tunnel.
You can optimize the entire RAG process by grouping the data into more meaningful parts yourself. The easiest way is to isolate the most relevant chapter and use it as a collection.
There is another collection that contains only the chapter about Layer 3 interface configuration. You will use this reduced knowledge base to see if the answer improves.
Step 71
Click the up-arrow icon or press the Enter key to send the prompt.
Step 72
Click the citations link.
Answer
You should see the best match.
Notice that the relevance score is 25.74 %, which is quite lower compared to the relevance score from the previous prompt, which was around 29 %. You can also see that this text chunk contains the note where it is explicitly written you cannot configure a static MAC address on a tunnel interface.
RAG works by retrieving information from a dataset and using it to generate answers. However, if the dataset is too large or contains a lot of irrelevant information (noise), it can confuse the system and lead to poor results. This is why understanding your dataset is so important. If you know your data well, you can create better, more focused prompts that guide the system toward the right information.
It also helps to prepare your data in ways that make it easier for RAG to work with. For example, instead of breaking the data into fixed-sized chunks, you can use semantic chunking. This means splitting the data into smaller pieces based on meaning, so each chunk contains complete and useful information. For example, in a PDF, one chapter might begin discussing VLANs, while later chapters delve into more detailed discussions of VLANs. Semantic chunking helps by grouping these disparate sections into a cohesive and meaningful piece of text. This increases the likelihood of RAG retrieving the most relevant information.
In the following task, you will explore some prompt engineering techniques and refine your queries to get the most out of a RAG system.
Explore Prompt Engineering
Your RAG setup is operational, and you have already learned the common pitfalls and challenges of using RAG, particularly concerning the data used for the database. Now, you want to test how different prompts influence the quality of the answers.
There are many prompt engineering techniques, and you may have used some of them without even realizing it. One of the most basic techniques is called zero-shot prompting, where you simply ask a question without providing any additional instruction, data, or context. This type of prompt is likely to yield poor answers.
A more advanced technique is few-shot prompting, where you include additional context, examples, or specific instructions within the prompt itself. This approach helps GPT provide more accurate and relevant answers by guiding its reasoning process.
Regardless of the prompt technique used, you can control how creative a GPT model is when generating answers by adjusting a parameter called temperature. Higher temperature values encourage more imaginative and diverse responses but can decrease accuracy and increase the likelihood of factual errors.
You plan to try both approaches and experiment with how your system responds to different values of the temperature parameter.
Step 75
Click the up-arrow icon or press the Enter key to process the prompt.
Answer
You should see the answer.
It seems like the system did not check if VLAN 5010 is even a valid VLAN ID. It took the prompt verbatim and started to provide step-by-step instructions on how to configure a VLAN, which was 5010 in this case. In other words, asking can you do something is not the same as asking whether it is possible to do something. Even so, the system provided factually incorrect answers. Remember to always fact-check the answers. Your network engineering expertise becomes crucial in this case. In other words, GPT and RAG are merely tools that heavily depend on the user. By no means are these tools a replacement for network engineers and their expertise.
First, Cisco Nexus 9000 switches have only one single VDC so this first part of the answer is irrelevant. The VDC feature is used on Cisco Nexus 7000 switches. Second, the show vdc all command is invalid, since there is no all flag for this command.
Step 77
Press the down-arrow key again.
Answer
You should see the final part of the answer.
The last part tells you how to configure an interface as a trunk and allow this single VLAN on the trunk port. The configuration is correct, but the GPT system did not tell you about the option of configuring an access port with this VLAN.
You will try again, this time using a different prompt.
Step 80
Click the up-arrow icon or press the Enter key to process the prompt.
Answer
You should see the answer.
This time the answer is correct. A small change in the prompt provided you with a completely different answer.
How different the answers to the same or a similar prompt will be is defined also by a parameter called temperature. You can also influence the answers by providing additional instructions in the prompt itself. You will explore the effect of the temperature parameter and adding the instructions in the following steps.
Step 82
Step 84
Click the chat settings icon.
Step 85
Click temperature in the side-panel.
Answer
You should see the temperature slider set to the default value of 0.8.
The temperature parameter ranges typically from 0 to 1 so 0.8 is considered very high. Higher temperature values allow for more expressive and creative answers, whereas lower values are used to get more basic and repetitive responses.
Step 86
Click the regenerate button.
Answer
This sends the same prompt and generates a different answer.
The answer is somewhat correct. It answered correctly regarding 5010 VLAN ID being out of range but notice how it incorrectly defined numerical ranges. This kind of mistakes regarding numbers are very common in GPT-generated answers. Always fact check the answers that you receive.
Higher temperature settings can lead to more elaborate and erroneous answers. You will lower the temperature and regenerate the answer in the following steps.
Step 88
Click the regenerate icon again.
Answer
Step 89
Click the regenerate icon one more time.
Answer
You should get a very similar answer.
Lower temperature settings also provide more similar answers when entering the same or similar prompts.
Now that you have learned the importance of prompt engineering and temperature, you will test the RAG system with the Fabric Information collection containing running configurations of the switches and various outputs from show commands.
Explore Prompt Engineering with RAG
Now you will see how prompt engineering works with RAG. You will use the Fabric Information collection as a knowledge base for the system. This collection contains the output of various show commands. You already uploaded the files in one of the previous tasks of this lab.
The fabric you are working with is a CLOS topology with one spine and two leaf switches. The devices are named spine01, leaf01 and leaf02, respectively. The fabric is configured as a BGP fabric with EVPN overlay. Other information will be disclosed during the exercise for verification purposes.
Step 94
Click the up-arrow icon or press the Enter key to process the prompt.
Answer
You should see the generated answer.
The retrieved IP for spine01 is correct but the system failed to tell you about the other two devices. Inspect the citations to figure out why it did not produce IPs for all devices and to verify if the retrieved IP is really valid and originates from the documentation.
Step 95
Click the spine01_running_config.txt citation.
Answer
You should see the citation.
You can see that the retrieved IP is indeed from the running configuration. If you checked the citation of leaf02_running_config.txt file, you would see the text in the following figure.
This part of the configuration has no mentions of the management VRF or IP. Interestingly, the relevance score is almost the same. Remember, a RAG system uses your prompt to look for similar entries in the database. You can try to get better results by iterating with the GPT.
Prompt iteration is a technique where you correct the GPT and provide more detailed information to get a better answer. It looks like a dialogue.
Step 97
Press the down-arrow key to add a prompt below the current response.
Answer
You should see the following prompt added: No, I need you to list all management IP addresses for leaf01, leaf02 and spine01 devices.
Notice how the entire chat interface looks like a messenger app with two people talking.
The goal of this prompt is to tell the GPT the answer was not what you expected and to provide more specific details, this time including the device names. Note that each TXT file is named after the device and that the first line includes information regarding the device name and the used command that produced the output.
Step 98
Click the up-arrow icon or press the Enter key to process the prompt.
Answer
You should see the following answer.
This time you did get three distinct IP addresses. Time to verify if they really are the management IP addresses. Note that this fabric has a BGP underlay with an EVPN overlay, which means, there are several IP addresses configured. Always fact-check the answers from a GPT system, even when using RAG.
Step 99
Click the first citation: leaf02_management_vrf_data.txt
Answer
You should see the output for the management VRF of leaf02.
Notice how the IP address is very close to the mgmt0 keyword. This output was gathered using the following commands: show ip route vrf management, show bgp vrf management all, show ip arp vrf management, show ip interface vrf management. The citation you see here is only one small part of the entire leaf02_management_vrf_data.txt file.
Keep in mind that you were using a fixed-length chunk size of 500 characters with a 100-character overlap. In such cases, it is often beneficial to provide the same information in different formats. For instance, running configuration and VRF data.
Step 104
Click the scrollbar above the prompt field.
Answer
You should see the window move to the right, revealing the other citations in full.
The spine01_management_vrf_data.txt contains the same part of the file, as you can see in the figure.
The output of the show commands for the management VRF turned out to be the best match for this prompt. This showcases the importance of careful prompt engineering and data preparation with RAG systems. What about the other two remaining citations?
Step 105
Click the leaf02_cdp_neighbors.txt citation.
Answer
You should see the output of the show cdp neighbors command for leaf02.
This citation is irrelevant and can be treated as unnecessary noise. The same goes for the leaf01_cdp_neighbors.txt citation. To address this issue, you can improve the process by using semantic chunking techniques or integrating a re-ranker. Semantic chunking ensures that the text is divided into meaningful segments that better align with the context of the prompt. By optimizing your chunking you can prioritize splitting based on semantic boundaries such as topic shifts or keyword density rather than arbitrary text lengths. This helps isolate the most relevant information while reducing the inclusion of irrelevant noise.
A re-ranker can further refine the results by evaluating and prioritizing the most contextually relevant chunks. By using a more advanced embedding model or a pre-trained re-ranker specialized in contextual understanding, you can filter out citations like leaf01_cdp_neighbors.txt, which do not add meaningful value to the context. This ensures that only the most relevant chunks are passed to the GPT model.
Since GPT processes all matched citations as a single context, minimizing unnecessary or noisy citations is crucial to avoid diluting the model's focus. Combining semantic chunking with a re-ranker can significantly enhance the quality of the input context, leading to more precise and accurate responses tailored to the prompt.
Satisfied with the capabilities of your on-premises small-scale RAG deployment, you use all that you learned during this testing to boost your productivity and optimize tedious, mundane, and error-prone tasks, supercharging your day-to-day work.
Step 106
This concludes the simulation.